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ABSTRACT: A brief discussion is given of the traditional version of the Maximum 
Entropy Method, including a review of some of the criticism that has been made in regard 
to its use in statistical inference. Motivated by these questions, a modified version of the 
method is then proposed and applied to an example in order to demonstrate its use with a 
given time series. 
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1. Historical Background 

The concept of entropy has gone through several distinct stages since its inception in the 
19th century. Originally formulated as a thermodynamic potential, Boltzmann reexpressed 
it as a measure of disorder. The connection between thermodynamics and disorder, or 
randomness, is given by Boltzmann's ansatz S = log(W) where W is the count of accessible 
configurations 

N\ 

W = 

and which leads to the expression 

i 

where Pi is the Boltzmann factor for the i th energy state. In the 20th century, through the 
work of Shannon [1] and Wiener [2] a second application was developed where this same 
expression came to be used as a measure of the amount of information contained in a string 
of characters each of frequency or probability p^. 

The conceptual association of disorder and information as opposites was a natural one 
to make, but there is considerable disagreement as to how these two ideas, thermodynamic 
entropy and information, are related. Beginning with Szilard [3], a direct connection between 
entropy decrease and information gain was made suggesting that the two differ only by a 
sign. Recently Zurek [4] has proposed a more complicated relation suggesting that gain in 
information is not entirely reflected in the loss in entropy. We will come back to the question 
information and entropy later. 
2a. Maximum Entropy and Inference 

Dating from his paper of 1957 [5], ET Jaynes presented a third use for the expression for 
entropy, employing it as a tool for statistical inference. Consider the problem of constructing 
the probability distribution of a system when we have obtained data/measurements in the 
form of a set of M numerical values {x,i}. With the above definition of entropy and a 
constraint on the extremum of the form 



mi = J2p> 



(where n\\ the first moment) then the full entropy might be written 

S = - Pi lo gfe) + ^ H Pi X i 
i i 

with A the Lagrange multiplier, and calculating 5S = we obtain 

p*= V' z= 5> A " 

i 

The undetermined multiplier is obtained from 



mi = (1) 
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This set of formulae is directly analogous to the steps taken in statistical mechanics that 
relate temperature to the multiplier in equilibrium physics. The algorithm for constructing 
the distribution using data is apparently straight forward, but as it turns out, there have 
been a number of critics over the years who have questioned Jaynes' formalism. 
2b. Criticism of the Maximum Entropy Formulation 

In the traditional ME formulation a strict analogy is made between the microscopic 
presentation of statistical mechanics, of which the Bolzmann factor is the principle result, 
and statistical inference. As Jaynes would have it, this is exactly what one must do for 
statistical inference; given a time series whose average is known, one must maximize 

S = Y,Pi log(pi) + A Pi%i 

i i 

where the time series of values Xj has been replaced by 

M N 

^J2 n i x i- ( 2 ) 

3 i 

In this expression, the M observed values of Xj are tabulated and then expressed as a sum 
of rti occurances of x%, rii occurances of X2, ^3 occurances of X3,.... There are several things 
wrong with this: 

Lit is necessary to make a replacement of frequencies 
with probabilities as in eq 2, which contradicts the position that 
probabilities need not always be associated with frequencies. 

2. In the variation of S in statistical mechanics, the constraint is 
made up of variables rij over which the variation is taken; 
in ME, the constraint is made up of data values, 
which are not free to be varied 

3. There have been several additional criticisms of the ME method 
beginning with Friedman and Shimony [4,5,6,7]. Good reviews of 
these questions can be found in Uffink [7]. In particluar, 
FS find an inconsistancy in ME with respect to its Bayesian 
properties. It will be shown below that the ME formalism will 
not always reproduce the mean value of the data correctly, and will 
generally produce a different standard deviation. 

2c. Modified Maximum Entropy 

If we confine the theory to time series data only, a corrected version of ME is easily 
obtained. Consider the following prescription for statistical inference. Given a time series 
Xj of M values we consider the total number of configurations as 



and then weight this product by an extra factor 



W 



Ui nil 



Q 



where 



Q 



p m '(x 1 )p m2 (x 2 )...p mN (x N ] 



which is the probability of a particular string of observations {xj}. That is, rrii is the number 
of times Xi has appeared in the time series. Then the maximization leads to 

5S = = SSi 

i 

where each contribution to the sum is of the form 

Si = -pilog(pi) + rrii \og{pi) 

and where the first term (the entropy part) arises from the count of configurations W, and 
the last term (the informational part) from the weight Q. The minimization procedure 
results in an expression for the probability (disregarding normalization) 



Pi = exp( 



Pi 



(3) 



which may be solved iteratively for p^, yeilding 



p(n) 



n 



where 



Z{n) = log(n) — log(log(n)) + O 



log(log(w)) 
log(n) 



Generally, in the case that successive elements of the time series cannot be treated as inde- 
pendent, we must use the above equation to solve for p(ni, n 2 ), p(nx,ri2, n 3 ), etc., depending 
on the nature of the correlations in the time series. The evaluation of W in this case is 
somewhat more involved, but that of Q remains the same. 



3. Example Application 
3a. Traditional ME 

One of the obvious failings of ME is that it does not reflect the different contribution 
made by a short time series versus that of a long one. In particular, if the data consisted 
of a single coin toss, there does not appear to be a formal solution of the evaluation of the 
Lagrange multiplier, i.e. equation 1. The following is an example of this difficulty (see also 
Uffink 1996). 

Suppose we have a three sided coin Xi = 1, 2, 3 and toss it several times with the result 
mi. This is the first few elements in a time series and ought to provide a first estimate in the 
probabilities of the three sides. According to the above algorithm, the Lagrange multiplier 
for this problem is 

x x q + x 2 q 2 + x 3 q 3 
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where q = e A . The prescription is to solve for A, that is 

(mi - x±)q + (mi - x 2 )q 2 + (m 1 - x 3 )q 3 







(4) 



A plot of the largest real positive root is given in Figure 1. If the time series contains only a 
few elements, it is possible that the average turns out to be mi = 1 or mi = 3 yet according 
to the plot this requires q = or q = oo, i.e. no disorder regardless of the data. 



Figure 1: the root of equation 4 with respect to mean value 



This difficulty extends to the moments in general. For example, for a two level system 
with "energies" ei,£2, in Jaynes' formulation, moments are determined from data 



X ; 



and these in turn are related to the Lagrange multipliers by 

1 r 



z 



e\e^ + e l 2 e* 2 



with 



Z = e* 1 + e^ 2 

where the multipliers are given by 

<f) t = A1Q + A 2 ^ + A 3 4 + ... 
For the two-level system, this works out to be 

( m i - 4) = ^-fc 

(mi - e l 2 ) 

for all I, which is overdetermined in general and so has no solution. 
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3b. Modified ME 

In contrast the modified method yields probability values for any length of time series, 
and in the limit of a large string, gives the same value as would be obtained from a direct 
frequency tabulation. As a demonstration of how this operates, consider the toss of a two- 
sided coin, using a possibly biased coin. After N = Uf L + n t tosses the probability for each 
side is given by A = , p [ nh \ % and P t = , p [ nt \ — r, where p(rih) is the solution to equation 

o J n p(n h )+p(n t j 1 p(n h )+p[n t ) ' ^ n > ^ 

3 above. 

Normalization of the probabilities is given by 

Z = Ph(n h ) +Pt(nt) 

The case similar to the one described above by Uffink is where a single toss results in a head 
which is obtained from 

Z = p h (l)+pt(p) 

In Figure 2 we plot the probability of heads and tails respectively, in a simulation where the 
coin is biased such that in the infinite limit we should have = .7 and p t = .3. As can be 
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Figure 2: the probabilities for the biased coin as a function of the number of tosses 

seen the initial values are rough, while in the limit of large data the values go over into the 
frequency values. 

Returning now to the problem discussed above, consider a 2-sided coin that can only land 
heads up, such that successive throws yield an increasingly long string of heads. Regardless 
of the string length, the traditional ME can only give the solution (ph,Pt) = (1, 0), as shown 
in Section 3a, while the modified version gives p t = 1 — ph and 

pin) 
1 +p[n) 

after the n th throw. This is plotted in Figure 3 and provides a more realistic description of 
how an experimenter might gradually, but inevitably, come to the conclusion that the toss 
is not random. 
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Figure 3: probability of heads for the completely biased coin 



4. Discussion 

In summary we have proposed that the traditional method of applying the Maximum 
Entropy method, which involves maximization of 

S = -J2Pi logfe) + A Yj Pi x i 
i i 

must be modified to 

i i 

when dealing with time series. This allows the elements of a time series, no matter how 
few in number, to be included and hence influence the resulting distribution. This method 
has several advantages, most notably that the elements of the time series can be put into 
the computation without difficulty and without complication whether the series is short or 
long. Also, as the series becomes very long, the probability values become equal to those 
found from direct frequency tabulation. One undeniable disadvanteage is that the resulting 
distribution is not usually a smooth function of the data. That is, for a short series, a small 
number of new data points may make a considerable change in the numerical values of the pf, 
this is in contrast with the traditional ME approach, but perhaps is a more realistic property 
of statistical inference. 

The modified expression is made up of an entropy component and an informational one. 
For short time series with little or no data, the entropy contribution dominates, starting 
off with the equal probability distribution when no data at all is at hand. Then, as data 
is accumulated (the longer time series), the informational part dominates, and it is to be 
expected that the manner of cross-over from one to the other should depend on the nature 
of the data. The interpretation of this is easily seen by considering the example of the 
completely biased coin where ph = 1. The information obtained from a single toss is given 
by ph log(ph) + Pt log(pt) = 0; this makes sense as the outcome is already known; however, 
we might ask how much information had been acquired in determining the distribution 
originally? The results obtained here suggest that the additional term in the expression for 
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entropy provides a means of fixing a value to the information contained in the statement 
of the distribution. These ideas are somewhat in line with earlier studies (Zurek [4], Lin 
[10]), though the context of this added term is different from these papers. In order to 
find how much information is already contained in the knowledge of the distribution, we 
must consider the manner in which the distribution was obtained. Initially we might start 
with ph = p t = |, and begin tossing, keeping record of the results. Each toss modifies ph 
and p t according to the function p{n) given in equation 3. As the number of tosses grows 
indefinitely, the amount of information contained in the data is given by the limiting value 
shown above, — nlog(p), which for the completely biased coin leads to log(n) for large n. 
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